This document explores Tesla’s stock prices from its initial public offering on June 29, 2010, to December 31, 2019. Tesla’s share prices have experienced staggering increases over several years and machine learning algorithms could unveil trends of seasonality or autocorrelated variables. Predictive analysis, and within it, machine learning, can greatly influence investors from the uncertainty of the market. This time-series data aims to address the ability of machine learning models to use the time-series data to predict Tesla’s future market behavior. This investigation will utilize machine learning techniques on the historical prices to evaluate the direction of market movement, as well as forecast the value of the stock in the future. The research questions are as follows:
The future predictions will be compared to present day stock value to determine the accuracy of the algorithms, in addition to a review of the factors that have affected the stock prices to date (i.e. supply and demand, economy, stock splits, etc.).
library(quantmod)
library(dplyr)
library(lubridate)
library(summarytools)
library(dvmisc)
library(corrplot)
library(tseries)
library(forecast)
library(ggplot2)
library(plotly)
library(caret)
library(formattable)
library(dygraphs)
library(hrbrthemes)
library(TSstudio)
library(FSelector)
stock_list <- "TSLA"
start_date <- as.Date("2010-06-29")
end_date <- as.Date("2020-01-01")
tesla <- NULL
for (i in seq(length(stock_list))){
getSymbols(stock_list, verbose = FALSE, src = "yahoo",
from=start_date,to=end_date)
temp_df = as.data.frame(get(stock_list))
temp_df$Date = row.names(temp_df)
row.names(temp_df) = NULL
colnames(temp_df) = c("Open", "High", "Low", "Close",
"Volume", "Adjusted", "Date")
temp_df = temp_df[c("Date", "Open", "High",
"Low", "Close", "Volume", "Adjusted")]
tesla = temp_df
}
In considering the COVID-19 pandemic, the Tesla dataset was only selected to include the years from 2010-2019. Due to the uncertainty surrounding the economy in 2020, I believe there will be white noise present in the 2020 data.
The data analysis stage first consists of cleaning, and inspecting the data for inconsistencies. Following these steps, the data may undergo transformations and modelling as required. As part of the data preparation stage, the following steps will be taken:
head(tesla)
## Date Open High Low Close Volume Adjusted
## 1 2010-06-29 3.800 5.000 3.508 4.778 93831500 4.778
## 2 2010-06-30 5.158 6.084 4.660 4.766 85935500 4.766
## 3 2010-07-01 5.000 5.184 4.054 4.392 41094000 4.392
## 4 2010-07-02 4.600 4.620 3.742 3.840 25699000 3.840
## 5 2010-07-06 4.000 4.000 3.166 3.222 34334500 3.222
## 6 2010-07-07 3.280 3.326 2.996 3.160 34608500 3.160
str(tesla)
## 'data.frame': 2394 obs. of 7 variables:
## $ Date : chr "2010-06-29" "2010-06-30" "2010-07-01" "2010-07-02" ...
## $ Open : num 3.8 5.16 5 4.6 4 ...
## $ High : num 5 6.08 5.18 4.62 4 ...
## $ Low : num 3.51 4.66 4.05 3.74 3.17 ...
## $ Close : num 4.78 4.77 4.39 3.84 3.22 ...
## $ Volume : num 93831500 85935500 41094000 25699000 34334500 ...
## $ Adjusted: num 4.78 4.77 4.39 3.84 3.22 ...
tesla$Date <- as.Date(tesla$Date)
class(tesla$Date)
## [1] "Date"
The ‘Date’ attribute was changed to represent a date type variable.
sum(is.na(tesla))
## [1] 0
There are no missing values.
Next, a correlation plot will determine whether the attributes are correlated and to what degree.
## Open High Low Close Volume Adjusted
## Open 1.0000000 0.9995588 0.9995524 0.9990113 0.4618287 0.9990113
## High 0.9995588 1.0000000 0.9994779 0.9996208 0.4710375 0.9996208
## Low 0.9995524 0.9994779 1.0000000 0.9995621 0.4526298 0.9995621
## Close 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
## Volume 0.4618287 0.4710375 0.4526298 0.4624999 1.0000000 0.4624999
## Adjusted 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
To understand the attributes further, the measures of central tendency will be reviewed.
Descriptive Statistics of Tesla Stock
| Min | Q1 | Median | Mean | Q3 | Max | Std.Dev | |
|---|---|---|---|---|---|---|---|
| Open | 3.23 | 6.85 | 42.42 | 36.62 | 52.80 | 87.00 | 22.89 |
| High | 3.33 | 6.96 | 43.19 | 37.26 | 53.55 | 87.06 | 23.24 |
| Low | 3.00 | 6.70 | 41.61 | 35.96 | 52.02 | 85.27 | 22.52 |
| Close | 3.16 | 6.87 | 42.32 | 36.63 | 52.78 | 86.19 | 22.90 |
| Volume (M) | 0.59 | 9.37 | 22.65 | 27.17 | 36.26 | 185.82 | 23.60 |
| Adjusted | 3.16 | 6.87 | 42.32 | 36.63 | 52.78 | 86.19 | 22.90 |
These values allow us to see the range of values that are present in the Tesla stocks over time. Notably, the range of the minimum and maximum stock values is quite large, likely due to trends over several years. Since this is a time-series dataset from the intial public offering, it is unlikely that outliers are present, since the value of the stock has changed drastically over several years.
For the purposes of the forecasting investigation, the closed stock price (i.e. the value of the stock at the end of the day) will be used. The table below displays the trends of the closed stock prices from 2010-2019.
Closed Tesla Stock Price Statistics
| Year | Min | Max | Average | % Change per Fiscal Year |
|---|---|---|---|---|
| 2010 | 3.16 | 7.09 | 4.67 | 11.47 |
| 2011 | 4.37 | 6.99 | 5.36 | 7.29 |
| 2012 | 4.56 | 7.60 | 6.23 | 20.62 |
| 2013 | 6.58 | 38.67 | 20.88 | 325.42 |
| 2014 | 27.87 | 57.21 | 44.67 | 48.17 |
| 2015 | 37.00 | 56.45 | 46.01 | 9.44 |
| 2016 | 28.73 | 53.08 | 41.95 | -4.35 |
| 2017 | 43.40 | 77.00 | 62.86 | 43.49 |
| 2018 | 50.11 | 75.91 | 63.46 | 3.83 |
| 2019 | 35.79 | 86.19 | 54.71 | 34.89 |
From this point, the data will be split into training and test sets. However, prior to splitting the data, a response variable to track the trends in the market must be calculated. In the case of reviewing the trends of the stock market, the stock’s return will be categorized into one of five classes. These classes will represent the quartiles of the daily returns.
Further analysis will be conducted on the training set.
# Generating the daily return
for (i in 2:nrow(tesla)){
tesla$Return[i] <-
(tesla$Close[i] - tesla$Close[i-1])/tesla$Close[i-1]
}
# Assigning a response variable
normalize <- function(x){
return ((x - min(x)) / (max(x) - min(x)))
}
tesla <- na.omit(tesla)
tesla$Return <- normalize(tesla[,8])
tesla$Response <- quant_groups(tesla$Return, groups = 5)
## Observations per group: 479, 478, 479, 478, 479. 0 missing.
summary(tesla$Response)
## [0,0.398] (0.398,0.432] (0.432,0.457] (0.457,0.493] (0.493,1]
## 479 478 479 478 479
# Setting training and test sets
train.set <- tesla %>%
filter(Date >= as.Date("2010/06/30") & Date <= as.Date("2016/12/31"))
test.set <- tesla %>%
filter(Date >= as.Date("2017/01/01") & Date <= as.Date("2019/12/31"))
Next, visualizations will be used to observe trends in the data. In order to ensure that the most accurate forecasts are obtained from the analysis, there are several aspects to consider when working with a time series dataset. The following data exploration will determine:
The first two visualizations will display the closing price of Tesla stock per day.
Histogram of Closing Price
Now that the closing stock prices have been visualized, it is clear that the data is not normally distributed. Further tesing for stationarity is shown below.
Testing for Stationary:
The Augmented Dickey-Fuller Test is a statistical test that determines if the data is stationary. A p-value of less than 0.05 is indicative of stationary data. With stationary data there is a constant mean and constant variance.
adf.test(train.set$Close)
##
## Augmented Dickey-Fuller Test
##
## data: train.set$Close
## Dickey-Fuller = -2.0211, Lag order = 11, p-value = 0.5694
## alternative hypothesis: stationary
Based on the p-value of 0.5711, the Tesla time-series is non-stationary. Differencing will be used to adjust for this discovery. The inbuilt function of ndiffs will calculate the number of differences required to make the time-series data stationary. Differencing can be completed using the diff function. The visualizations below will be used to determine which transformation is best suited for the data. The differenced data will undergo the Augmented Dickey-Fuller Test again.
ndiffs(train.set$Close, test = "adf")
## [1] 1
Based on the plots above, the log differenced data appears to be best performing metric to stabilize the variance of the time series.
adf.test(diff(log(train.set$Close)))
## Warning in adf.test(diff(log(train.set$Close))): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(log(train.set$Close))
## Dickey-Fuller = -11.137, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
Based on the differencing of 1 lag and a logarithmic transformation, the time series is now a stationary process.
Autocorrelation:
It is important to determine if autocorrelation is present within the data. Autocorrelation refers to the degree of linear similarity that is present between the data and a lagged version of the past data. In other words, this assessment determine if the data is previous data influenced the current observations. The following tests will compare the original and stationary data for the presence of autocorrelation.
The above plot (right) shows that the autocorrelation falls to zero quickly, meaning that the data is now stationary. The autocorrelation is within the dashed blue line which indicates that there is no longer a lag that is correlated with the data series. The partical autocorrelation plot below confirms this conclusion.
pacf(diff(log(train.set$Close)))
Presence of Seasonality:
# 253 is the average number of trading days
train_diff <- train.set %>%
mutate(Logdiff = c(NA, diff(log(Close)))) %>%
na.omit()
tesla_ts <- xts(train_diff$Logdiff, train_diff$Date)
ts_info(tesla_ts)
## The tesla_ts series is a xts object with 1 variable and 1638 observations
## Frequency: daily
## Start time: 2010-07-01
## End time: 2016-12-30
temp <- ts(tesla_ts,start=c(2010,1),freq=253)
ts_decompose(temp)
Normalization:
train_norm <- normalize(train.set[,c(2:5)])
train_norm$Response <- train.set$Response
The ARIMA model requires additional technical indicators to be calculated to strengthen the available information. These indicators include the following:
for (i in 1:nrow(train_diff)){
#MACD
train_diff$MACD <- MACD(Cl(train_diff), nFast=12, nSlow=26, nSig=9)
#RSI
train_diff$RSI <- RSI(Cl(train_diff), n=14)
#Price Rate of Change
train_diff$ROC <- ROC(Cl(train_diff),n=14) * 100
#Simple Moving Average
train_diff$SMA <- SMA(Cl(train_diff),n=14)
hlac <- data.frame(x=Hi(train_diff), y=Lo(train_diff), z=Cl(train_diff))
#Stochastic Oscillator
train_diff$STO <- stoch(hlac, nFastK = 14) *100
#William's %R
train_diff$WPR <- WPR(hlac, n=14) * (-100)
}
ARIMA model:
set.seed(10)
arima_model <- auto.arima(train_diff$Logdiff, ic = c("aicc", "aic", "bic"))
arima_model2 <- auto.arima(train_diff$Logdiff, ic = "aic", xreg = train_diff$ROC)
K-Nearest Neighbours:
set.seed(1)
knn_model <- train(Response ~ High + Low + Open + Close,
data = train_norm,
method = "knn")
predictions <- predict(knn_model, newdata = test.set)
confusionMatrix(predictions, test.set$Response)
## Confusion Matrix and Statistics
##
## Reference
## Prediction [0,0.398] (0.398,0.432] (0.432,0.457] (0.457,0.493]
## [0,0.398] 0 0 0 0
## (0.398,0.432] 1 2 2 1
## (0.432,0.457] 70 78 79 73
## (0.457,0.493] 33 28 36 39
## (0.493,1] 49 40 33 46
## Reference
## Prediction (0.493,1]
## [0,0.398] 0
## (0.398,0.432] 2
## (0.432,0.457] 70
## (0.457,0.493] 34
## (0.493,1] 38
##
## Overall Statistics
##
## Accuracy : 0.2095
## 95% CI : (0.181, 0.2404)
## No Information Rate : 0.2109
## P-Value [Acc > NIR] : 0.5499
##
## Kappa : 0.0126
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: [0,0.398] Class: (0.398,0.432]
## Sensitivity 0.0000 0.013514
## Specificity 1.0000 0.990099
## Pos Pred Value NaN 0.250000
## Neg Pred Value 0.7971 0.804290
## Prevalence 0.2029 0.196286
## Detection Rate 0.0000 0.002653
## Detection Prevalence 0.0000 0.010610
## Balanced Accuracy 0.5000 0.501806
## Class: (0.432,0.457] Class: (0.457,0.493]
## Sensitivity 0.5267 0.24528
## Specificity 0.5182 0.77983
## Pos Pred Value 0.2135 0.22941
## Neg Pred Value 0.8151 0.79452
## Prevalence 0.1989 0.21088
## Detection Rate 0.1048 0.05172
## Detection Prevalence 0.4907 0.22546
## Balanced Accuracy 0.5224 0.51256
## Class: (0.493,1]
## Sensitivity 0.2639
## Specificity 0.7246
## Pos Pred Value 0.1845
## Neg Pred Value 0.8066
## Prevalence 0.1910
## Detection Rate 0.0504
## Detection Prevalence 0.2732
## Balanced Accuracy 0.4942
Feature selection will be used to find the best model for the classification algorithm. This process will determine which attribute(s) best predicts the response of the next day’s market. The Arima model uses the Akaike information criterion (AIC) to best fit the model. Two other feature selection methods, a filter-based and wrapper-based technique are shown below.
# Filter-based Feature Selection Technique
# Correlation Based
cfs(Response ∼ High + Low + Open + Close, data = train_norm)
## [1] "High"
# Wrapper-based Feature Selection Technique
#rfe(x = train_norm[, 1:4], y = train_norm[, 5], rfeControl = control, metric = "ROC")